SCAMPP+FastTree: improving scalability for likelihood-based phylogenetic placement

Abstract Summary Phylogenetic placement is the problem of placing ‘query’ sequences into an existing tree (called a ‘backbone tree’). One of the most accurate phylogenetic placement methods to date is the maximum likelihood-based method pplacer, using RAxML to estimate numeric parameters on the backbone tree and then adding the given query sequence to the edge that maximizes the probability that the resulting tree generates the query sequence. Unfortunately, this way of running pplacer fails to return valid outputs on many moderately large backbone trees and so is limited to backbone trees with at most ∼10 000 leaves. SCAMPP is a technique to enable pplacer to run on larger backbone trees, which operates by finding a small ‘placement subtree’ specific to each query sequence, within which the query sequence are placed using pplacer. That approach matched the scalability and accuracy of APPLES-2, the previous most scalable method. Here, we explore a different aspect of pplacer’s strategy: the technique used to estimate numeric parameters on the backbone tree. We confirm anecdotal evidence that using FastTree instead of RAxML to estimate numeric parameters on the backbone tree enables pplacer to scale to much larger backbone trees, almost (but not quite) matching the scalability of APPLES-2 and pplacer-SCAMPP. We then evaluate the combination of these two techniques—SCAMPP and the use of FastTree. We show that this combined approach, pplacer-SCAMPP-FastTree, has the same scalability as APPLES-2, improves on the scalability of pplacer-FastTree and achieves better accuracy than the comparably scalable methods. Availability and implementation https://github.com/gillichu/PLUSplacer-taxtastic. Supplementary information Supplementary data are available at Bioinformatics Advances online.

1 Additional Results replicates) containing 1000 sequences. The top row shows average delta error, bottom row shows runtime (each per query). The "X" indicates that pplacer-RAxML failed with an out of memory error on these experiments. Peak memory usage on the ROSE datasets is negligible (< 1GB).

Overall results
nt78 RNASim ROSE 1000-Seq nt78 (full) nt78 (0.25) nt78 (0.10) 50k (full) 50k (0.25) 100k (full) 100k (0. 25  We provide a summary of the per-query delta error (top) and runtime (bottom) results shown across the methods on the simulated datasets with large backbone trees (we omit the ROSE datasets because they are too small). We show the query lengths in the row below the dataset. 'X' indicates the method failed to run on this dataset. 'X*' indicates these methods were not run but were expected to fail from previous studies. For instance, RNASim-50k and pplacer-RAxML were not run because they have already been shown to fail [7]. The lowest delta error for each dataset column (including all ties within 0.1) is in boldface. '-' indicates this method was not run on the corresponding dataset.        Figure S7: A histogram of delta error distribution, on the nt78 simulated dataset, evaluated for fulllength query sequences.

Delta Error
Suppose T is the backbone tree used on leafset L. Let q be a single query sequence and let T * be the true tree on L ∪ {q}. Let P be the tree produced when the query q is added into the backbone tree T using the phylogenetic placement pipeline. For tree A and set B a subset of the leaves of A, we denote the subtree of A induced by leafset B by A| B . Now for an arbitrary tree t (on whatever leafset), let B(t) denote the set of bipartitions (where each edge in t defines a bipartition on its leafset). Then the delta error is the increase in the number of false negatives produced by adding q into T . Hence, The scripts to calculate delta error were adapted from published scripts [1], and also use utilities from newick_utils.

Numeric Parameters
Numeric parameters are estimated on a tree topology under the Generalized Time Reversible (GTR) model with gamma-distributed rates-across sites. These numeric parameters include branch lengths, the substitution rate matrix (e.g., the 4x4 Generalized Time Reversible), and gamma distribution for rates-across-sites. Each placement method specifies a preferred software for producing these estimated parameters. In this paper we used existing numeric parameters where possible and estimated using RAxML and FastTree if necessary. We provide relevant commands below.
For pplacer-SCAMPP-FastTree, we use published FastTree-2-estimated numeric parameters (e.g., branch lengths, substitution rate matrix and gamma distribution) when possible; here there are no published numeric parameters we compute all three numeric parameters using FastTree-2 with the following command: • FastTreeMP -nosupport -gtr -gamma -nt -log reestimated.fasttree.log -intree intree.fasta < alignment.fasta > reestimated.fasttree.tree For pplacer-SCAMPP-RAxML, we use published numeric parameters (e.g., branch lengths, substitution rate matrix and gamma distribution) [7] for all datasets other than the ROSE datasets, which had not been previously evaluated in this context. We provide the RAxML command used to estimate all these numeric parameters on the ROSE datasets below: • RAxML-7.2.8-ALPHA/raxmlHPC-PTHREADS -f e -g reference.nwk -m GTRGAMMA -s alignment.phylip -n outname -p 1984 -T 16 -w outdir/ For all other datasets, RAxML 7 was used to estimate all three numeric parameters for pplacer-SCAMPP-RAxML, with one exception.
On the RNASim dataset, we used published FastTree-2-estimated branch lengths [7], and RAxML 7-estimated substitution rate matrix and gamma distribution. We use the published substitution rate matrix and gamma distribution on the RNASim-200k dataset [1] for all smaller subsets of the RNASim dataset, as all subsets of the RNASim dataset come from the same million-sequence tree so we expect the substitution rate matrix and gamma distribution estimated on the RNASim-100k dataset to be quite similar to the numeric parameters for RNASim-200k.

pplacer-FastTree
We used estimated branch lengths and all other numeric parameters from FastTree-2.
To run pplacer with FastTree numeric parameters, we can use the taxtastic software to reformat the FastTree info files. This builds a "reference package" which consists of a backbone alignment, corresponding phylogeny and estimated numeric parameters on that phylogeny.
• taxtastic/taxtastic-env/bin/taxit create -P ng-my.refpkg -l name --aln-fast ref.fa --tree-file backbone_pp.tree --tree-stats reestimated.fasttree.log The new version of pplacer-FastTree (using the speedup in placing fragmentary query sequences implemented on May 16, 2022) was run for results on RNASim-100k and RNASim-200k. For all other datasets and results on fragmentary query sequences, the version of pplacer-FastTree not implementing this speedup on fragmentary query sequences was used.

EPA-ng
We estimated branch lengths and all other numeric parameters using RAxML 8.

APPLES-2
We estimated branch lengths and all other numeric parameters using FastTree-2 under minimum evolution. Please note that before running APPLES-2 we converted all U in our sequences to T , as APPLES-2 only understands the ACTG alphabet.

Comparison of RAxML and FastTree 2 for Numeric Parameter Estimation
There are many differences between FastTree and RAxML's parameter estimation process. Our commands to both FastTree and RAxML specify to estimate numeric parameters under the discrete gamma model (specified with -gamma). By default, FastTree estimates the numeric parameters with 20 discrete rate categories with a fixed rate for each site (also known as the CAT20 approximation). In contrast, by default RAxML always uses 4 rate categories to approximate the gamma distribution of rate heterogeneity, as seen in the documentation here: https://cme.hits.org/exelixis/resource/download/NewManual.pdf on page 11. Our study suggests that pplacer using FastTree 2 numeric parameters can scale to larger datasets and so has fewer numerical problems, compared to pplacer using RAxML. More work is needed to understand why this is true.